Skip to content

[CoreML EP] Add Identity, Ceil, Tile builders + drop trivial-only partitions#28293

Open
maxwbuckley wants to merge 5 commits into
microsoft:mainfrom
maxwbuckley:coreml-tile-ceil-identity
Open

[CoreML EP] Add Identity, Ceil, Tile builders + drop trivial-only partitions#28293
maxwbuckley wants to merge 5 commits into
microsoft:mainfrom
maxwbuckley:coreml-tile-ceil-identity

Conversation

@maxwbuckley
Copy link
Copy Markdown
Contributor

@maxwbuckley maxwbuckley commented Apr 30, 2026

Description

Adds three MLProgram op builders (Identity, Ceil, Tile) to the CoreML EP and a partition-quality heuristic that drops CoreML partitions consisting entirely of trivial shape / cheap-elementwise ops.

No tracking issue; discovered via YOLOv10 partitioning analysis on Apple Silicon.

Empirical impact

YOLOv10n, M3 Max, MLProgram, batch 1, 1500-iter pooled:

Partitions Mean StdDev P99
Without this patch 4 3.798 ms 0.867 6.608
With this patch 3 3.403 ms 0.636 5.957

10.4% mean speedup, 26% stddev tightening.

Why both pieces are coupled

Adding the builders alone is net-negative on graphs where these ops sit in isolated chains. Per-op CoreML dispatch overhead on M3 Max (32-op chains on a 1×64×56×56 fp32 tensor, n=2997, MLProgram):

op CPU EP per op CoreML EP per op
Identity <1 µs ~14 µs
Ceil ~6 µs ~12 µs
Tile ~10 µs ~10 µs

A trivial-only partition pays ~50-100 µs round-trip marshalling plus ~10 µs per op of CoreML dispatch, vs <1 µs each on CPU. Worth claiming only when sandwiched between compute-heavy ops, where the round-trip is already paid for. The heuristic enforces that.

Implementation

New op builders. Identity emits MIL identity (NN path uses LINEAR(α=1, β=0)). Ceil joins the existing unary chain in UnaryOpBuilder. Tile emits MIL tile; it overrides HasSupportedInputsImpl to additionally accept INT32/INT64/BOOL (Tile is shape-only data movement, so the default float-only filter rejected it on common YOLO grid-index post-processing) and accepts a runtime reps tensor in addition to a constant initializer.

Heuristic. CoreMLExecutionProvider::GetCapability now uses the callback-taking overload of CreateSupportedPartitions (same as NNAPI EP). A partition is kept iff at least one node is outside the trivial set:

{Identity, Cast, Reshape, Squeeze, Unsqueeze, Flatten, Transpose, Tile, Ceil}

This lets the new builders absorb mid-chain trivial ops into existing CoreML partitions (the win) without claiming isolated trivial chains that would force a needless CPU→CoreML→CPU detour (the regression).

Tests

coreml_basic_test.cc covers both halves of the heuristic.

Builder coverage (compute anchor present → claimed):

  • IdentityWithConvAnchor, CeilWithConvAnchor, TileWithConvAnchor

Heuristic coverage:

  • ConvTrivialChainConvKeepsAllOnCoreML — Conv → Identity → Cast → Reshape → Conv stays in a single CoreML partition
  • TrivialOnlyChainIsNotClaimedByCoreML, ReshapeOnlyChainIsNotClaimedByCoreML, TransposeOnlyChainIsNotClaimedByCoreML, TileOnlyIsNotClaimedByCoreML, CeilOnlyIsNotClaimedByCoreML, MixedTrivialChainIsNotClaimedByCoreML — each falls back to CPU

Trivial-only tests pin graph_optimization_level = Default so passes like IdentityElimination / CastElimination cannot pre-empt the heuristic - what's exercised is GetCapability itself.

All 28 CoreML EP tests pass locally on macOS 26.3 / M3 Max.

🤖 Generated with Claude Code

maxwbuckley and others added 5 commits April 30, 2026 13:43
…titions

10.4% mean speedup and 26% stddev tightening on YOLOv10n (M3 Max,
MLProgram, batch 1, 1500 iterations pooled), with no regression on
ResNet50 (which contains no Identity/Ceil/Tile).

|                            | Partitions | Mean    | StdDev | P99    |
|----------------------------|-----------:|--------:|-------:|-------:|
| Without builders           |          4 | 3.798ms |  0.867 | 6.608  |
| With builders + heuristic  |          3 | 3.403ms |  0.636 | 5.957  |

No tracking issue; discovered via YOLOv10 partitioning analysis.

== What changes ==

Three new MLProgram op builders: Identity, Ceil, Tile. Tile additionally
accepts INT32/INT64/BOOL inputs (it is shape-only data movement; the
default float-only filter rejected it on common YOLO grid-index
post-processing) and accepts a runtime 'reps' tensor, not only a
constant initializer.

A partition-quality heuristic in CoreMLExecutionProvider::GetCapability
that drops partitions whose nodes are all in {Identity, Cast, Reshape,
Squeeze, Unsqueeze, Flatten, Transpose, Tile, Ceil}. The heuristic uses
the callback-taking overload of CreateSupportedPartitions (same as
NNAPI EP); a partition is kept iff at least one node is outside the
trivial set.

== Why both ==

Op coverage and the heuristic are coupled: adding the builders alone is
net-negative on graphs where these ops sit in isolated chains.
Per-op CoreML dispatch overhead on M3 Max (32-op chains on a
1x64x56x56 fp32 tensor, n=2997, MLProgram):

| op       | CPU EP per op | CoreML EP per op |
|----------|--------------:|-----------------:|
| Identity |        <1 us  |          ~14 us  |
| Ceil     |         ~6 us |          ~12 us  |
| Tile     |        ~10 us |          ~10 us  |

A trivial-only partition pays ~50-100us round-trip marshalling plus
~10us per op of CoreML dispatch, vs <1us each on CPU. Worth claiming
only when sandwiched between compute-heavy ops where the round-trip is
already paid for. The heuristic enforces that.

== Tests ==

coreml_basic_test.cc covers both halves of the heuristic: builder
coverage with Conv anchors (claimed), and heuristic coverage with one
sandwich case (claimed) plus six trivial-only chains across different
op types (dropped). Trivial-only tests pin graph_optimization_level=
Default so passes like IdentityElimination cannot pre-empt the
heuristic - what's exercised is GetCapability itself.

All pass locally on macOS 26.3 / M3 Max.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves conflict in coreml_basic_test.cc where this branch's
tile/ceil/identity helpers + tests landed in the same file region as
the Split11/13/7 tests merged via microsoft#28270. Both sets are preserved
sequentially.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
clang-format flagged a double blank line between the QuickGeluTestFp16
test and the tile/ceil/identity helper namespace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…IsTrivial

Replaces the hand-maintained kTrivialOpTypes set in
CoreMLExecutionProvider::GetCapability with a virtual
IOpBuilder::IsTrivial(node) method overridden by each trivial builder
(Identity, Cast, Reshape, Squeeze, Flatten, Transpose, Tile, plus Ceil
inside UnaryOpBuilder).

The marker now lives next to each builder's other classification methods,
so adding a new trivial builder won't drift from the partition heuristic,
and a UnaryOpBuilder-style multi-op-type builder can answer per node.
The trivial-partition heuristic itself is unchanged; the eight existing
*Chain*/*Anchor* tests pass without modification.

The Cast override carries a one-line note explaining why it counts as
trivial (marshalling-cost-dominated for small tensors) since it's the
least obviously cheap op in the set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tity

# Conflicts:
#	onnxruntime/test/providers/coreml/coreml_basic_test.cc
@maxwbuckley
Copy link
Copy Markdown
Contributor Author

@yuslepukhin last one 🥳

Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments.

Coverage gaps:

  • No test for Tile with non-unit repeats (e.g., reps=[1,1,2,2]) — would be good to verify actual tiling works on macOS.
  • No test for Identity/Tile on non-float types (int32, bool) — though the builders' type restrictions make this less relevant.
  • No standalone test for the NeuralNetwork path of Identity or Tile builders (only MLProgram is tested). The - NeuralNetwork path exists in both builders but RunConvChainTest uses MakeCoreMLExecutionProvider("MLProgram").
  • The tile_with_repeats parameter in MakeConvWithTrivialChainModel is accepted but unused ((void)tile_with_repeats;). This is dead code — it was likely meant to support a variant test but was never wired up.

node->add_input("conv_out");
node->add_input("reps");
node->add_output("Y");
(void)tile_with_repeats;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No C casts. Use ORT_UNUSED_* macros

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants